In this project you, will analyze a dataset containing annual spending amounts for internal structure, to understand the variation in the different types of customers that a wholesale distributor interacts with.
Instructions:
In [1]:
# Import libraries: NumPy, pandas, matplotlib
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
sns.set()
# Read dataset
data = pd.read_csv("wholesale-customers.csv")
print "Dataset has {} rows, {} columns".format(*data.shape)
data.head() # print the first 5 rows
Out[1]:
In [2]:
#visualize
sns.pairplot(data,size=2)
Out[2]:
In [3]:
data.describe()
Out[3]:
In [4]:
#Correlation Table
data.corr()
Out[4]:
1) In this section you will be using PCA and ICA to start to understand the structure of the data. Before doing any computations, what do you think will show up in your computations? List one or two ideas for what might show up as the first PCA dimensions, or what type of vectors will show up as ICA dimensions.
Answer:
PCA is used for two cases: one is to look correlation between the features of the data and second to perform feature reduction. It produces low-dimensional representation of a dataset . It finds sequence of linear combinations of variables that have maximal variance and are mutually uncorrelated.
PCA finds vectors on which the projected data has maximum variance. ICA, on the other hand, finds vectors on which the projected data is statistically independent.In a more practical way we can say that PCA helps when you want to find a reduced-rank representation of your data and ICA helps when you want to find a representation of your data as independent sub-elements. In layman terms PCA helps to compress data and ICA helps to separate data.
PCA will create new features that minimizes the information loss .
As per the std(standard deviation) 'Fresh' and 'Grocery' have biggest impact on first Principal component, and can show better differntiate between customer segments.
Looking at correlation table, Grocery with Milk and Detergents Paper are highly correlated to each other, so we could expect the this new composite feature would take in 'features' of all three to make this new one component.
For ICA: All features are used.It will be 6x6 matrix , as we need to get independent components.ICA is not used for reduction , it selects components so that the distribution carries a maximum amount of independent information.
We are using all 6 features, and each vector most probably shows a unique cluster of independent items.
In ICA the basis you want to find is the one in which each vector is an independent component of your data, you can think of your data as a mix of signals and then the ICA basis will have a vector for each independent signal.
In [5]:
# TODO: Apply PCA with the same number of dimensions as variables in the dataset
from sklearn.decomposition import PCA
pca = PCA(n_components=6).fit(data)
# Print the components and the amount of variance in the data contained in each dimension
print pca.components_
print "\n"
print pca.explained_variance_ratio_
In [6]:
plt.plot(np.cumsum(pca.explained_variance_ratio_), '-')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
2) How quickly does the variance drop off by dimension? If you were to use PCA on this dataset, how many dimensions would you choose for your analysis? Why?
Answer:
As per the 'explained_variance_ratio' the variance drops sharply after second component (0.4->0.07) and then there is slow drop off.
The goal is to minimize the information loss and maximize the variance.
There can be two methods to choose the pca components : first can be ,choose dimensions which retain>95% of the variance or by the elbow method (as per the graph where variance drops off).
By firt method: as per the ratios the total of first three components retains majority of information - (approx. 93%). By Second method: as per graph , the first two components retain 86% of the information ,so that is also not bad.
As we also have to reduce the features I think 2 dimension will work.
3) What do the dimensions seem to represent? How can you use this information?
Answer:
The dimensions represent the eigenvectors , and the directions along the data which has maximum variance .
The eigenvector with the greatest eigenvalue is the first principal component and the second principal component is the eigenvector with the second greatest eigenvalue, and so on.
In this case, the first two components have majority of variance or the components which has most of the information.
The first dimension is mostly by 'Fresh' and the second can be combination of ('Grocery' with 'Milk' with 'Detergents_Paper').
Therefore, instead of representing the data in 'six' dimensions , we would use 'two' dimensions to represent the data and get rid of dimensions which gives us little information.
This information is important because: First, it can be used to mitigate problems caused by the curse of dimensionality. Second, dimensionality reduction can be used to compress data while minimizing the amount of information that is lost. Third, understanding the structure of data with hundreds of dimensions can be difficult; data with only two or three dimensions can be visualized easily.
In [7]:
# TODO: Fit an ICA model to the data
# Note: Adjust the data to have center at the origin first!
from sklearn.decomposition import FastICA
# Normalize components
normalized_data = data.copy()
normalized_data -= normalized_data.mean(axis=0)
normalized_data /= normalized_data.std(axis=0)
ica = FastICA().fit(normalized_data)
# Print the independent components
print ica.components_
#visulaize
plt.figure(figsize=(15,5))
sns.heatmap(pd.DataFrame(ica.components_, columns= list(data.columns)),annot=True)
Out[7]:
4) For each vector in the ICA decomposition, write a sentence or two explaining what sort of object or property it corresponds to. What could these components be used for?
Answer:
Independent components is derived by minimizing the mutual information and separating the independent components.
Components can be interpreted as: the absolute value of the elements of the unmixing matrix increases, the corresponding feature has a strong effect on that components.
So as per thr heat map ,
'Grocery and Detergents_Paper have strong effect on component4, implies that for every 0.13 detergents_paper we have 0.12 Grocery less.
'Milk and Grocery on component5, for every 0.048 of Milk we have 0.089 less Grocery.
For component3 the Fresh has the strongest effect. Similarly for component1 Milk, for component6 delicatessen and for component2 Frozen.
From ICA, the negative/positive implies that which product are anti-correlated/correlated to each other.
We can interpret as:
We can interpret the components as customers/buyers types as to say, so like component1 'Frozen' product is most prevalent.
If 'Milk' and 'Grocery' had the values 10, -5 respectively, this could imply that for every 10 units of 'Milk' the buyer has, they have 5 unit less of 'Grocery'.
This can be used for segmenting the customers to determine the behavior. The components tells how a set of customer is buying the product.
Answer:
Advantages of K Means clustering:
Advantages of Gaussian Mixture Models:
Which one to choose:
GMM works on the seprating boundries, assign probability to points in middle of decision boundary,when there is some underlying chances that customer could be from other cluster.
6) Below is some starter code to help you visualize some cluster data. The visualization is based on this demo from the sklearn documentation.
In [8]:
# Import clustering modules
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
In [9]:
# TODO: First we reduce the data to two dimensions using PCA to capture variation
pca=PCA(n_components=2)
reduced_data = pca.fit_transform(data)
print reduced_data[:10] # print upto 10 elements
reduced_data.shape
Out[9]:
In [10]:
# TODO: Implement your clustering algorithm here, and fit it to the reduced data for visualization
# The visualizer below assumes your clustering object is named 'clusters'
clusters_gmm=[]
clusters_kmeans=[]
for cl in xrange(2,6):
clusters_gmm.append(GMM(n_components=cl).fit(reduced_data))
print "%d Clusters: "%cl ,clusters_gmm[-1],'\n\n'
for cl in xrange(2,6):
clusters_kmeans.append(KMeans(n_clusters=cl).fit(reduced_data))
print "%d Clusters: "%cl ,clusters_kmeans[-1],'\n\n'
In [11]:
# Plot the decision boundary by building a mesh grid to populate a graph.
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
hx = (x_max-x_min)/1000.
hy = (y_max-y_min)/1000.
xx, yy = np.meshgrid(np.arange(x_min, x_max, hx), np.arange(y_min, y_max, hy))
# Obtain labels for each point in mesh. Use last trained model.
Z_gmm=[]
Z_kmeans=[]
for x in range(4):
Z_gmm.append(clusters_gmm[x].predict(np.c_[xx.ravel(), yy.ravel()]))
Z_kmeans.append(clusters_kmeans[x].predict(np.c_[xx.ravel(), yy.ravel()]))
In [12]:
# TODO: Find the centroids for KMeans or the cluster means for GMM
centroids_gmm = []
centroids_kmeans=[]
print "GMM\n"
for x in range(2,6):
centroids_gmm.append(clusters_gmm[x-2].means_)
print "%d clusters: "%(x), '\n',centroids_gmm[x-2],'\n\n'
print "\nKMeans \n"
for x in range(2,6):
centroids_kmeans.append(clusters_kmeans[x-2].cluster_centers_)
print "%d clusters: "%(x), '\n',centroids_kmeans[x-2],'\n\n'
In [13]:
# Put the result into a color plot
def plot(Z,centroids):
for i in range(4):
Z[i] = Z[i].reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z[i], interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
plt.scatter(centroids[i][:, 0], centroids[i][:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('Clustering on the wholesale grocery dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
In [14]:
print "GMMs \n"
plot(Z_gmm,centroids_gmm)
In [15]:
print "\n KMeans \n"
plot(Z_kmeans,centroids_kmeans)
In [16]:
df = pd.DataFrame(pca.inverse_transform(centroids_gmm[0]),columns=data.columns).T
df.columns= ['Cluster-%i'% x for x in xrange(1,3)]
print df
In [17]:
df = pd.DataFrame(pca.inverse_transform(centroids_gmm[1]),columns=data.columns).T
df.columns= ['Cluster-%i'% x for x in xrange(1,4)]
print df
In [18]:
df = pd.DataFrame(pca.inverse_transform(centroids_kmeans[-1]),columns=data.columns).T
df.columns= ['Cluster-%i'% x for x in xrange(1,6)]
print df
7) What are the central objects in each cluster? Describe them as customers.
Answer:
There are two clusters: one of highest volume customer and other smaller family run shop.
The central objects marked with X are the average customers in the cluster.
By trying with different clustering methods,
3 Clusters are the best for this model.
Inverse transform of GMM 3 clusters :
cluster1 : spends lot on Fresh then on Milk,Grocery,Frozen and delication(very less as compared to Fresh) and less on detergents
cluster2 : spends lot on Fresh then approx same on Milk,Grocery,then on Frozen,detergents and delication.
cluster3 : spends lot on Grocery then on Milk then on Fresh and Detergents.
Answer:
The technique that gives insight into the given data was PCA with Gaussian Mixture Models Clustering.
As data has small sets of features , PCA was able to determine the new components with less information loss, although if we had have more features like more than 30 features PCA might be more suitable for that.
For clustering either KMeans or GMM both are suitable. GMM was preferred , as the GMM object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models, which helps when we don't know the clear cutoff between the clusters.
Segmenting the data into subsets of consumers helps marketers to implement strategies to target them.
9) How would you use that technique to help the company design new experiments?
Answer:
Now that we have segments created , the company could retest their delivery method, from a regular morning delivery to a cheaper, bulk evening delivery, on the two created segments. It is very likely that there will be no change on high volume customers but there will be some decline in satisfaction rate in small family run shop segment.
So company could come up with new delivery methods to implement on particular segments and can see the change.
So by clustering technique, company can try different techniques and can do experiments on the segments as what can be the best promising technique that would not loose customers or keep customers happy.The business can determine what product sell well together and keep them together to make more profit or keep customers even happier.
With two segments we can use A/B testing, it is a method of comparing the clusters against each other to determine which one performs better. AB testing uses data & statistics to validate new design changes and techniques.By using controlled tests and gathering empirical data, you can figure out exactly which marketing strategies work best for your company and your product.
10) How would you use that data to help you predict future customer needs?
Answer:
The wholesale distributor can easily use the data and can first categorize the future customer to one of the clusters.
As both the clusters will have different techniques (the needs and the demands) the analyst could do analysis on their new delivery methods based on customer's volume.
The company can run segmentation analysis to evaluate the sales, profit, and growth of each group. Say if, high volume customers had higher sales and profit figures along with growth potential, the company could stategically choose to focus their efforts on this segment.
Therefore, they can reduce the amount of complaints about future changes and potentially losing customers.
They can then make use of supervised machine learning techniques to model buying habits of each segment to ensure necessary inventory and delivery.